BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations
نویسندگان
چکیده
Pre-trained models are essential as feature extractors in modern machine learning systems various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features the input sound. For recognizing sounds regardless perturbations such varying pitch or timbre, be to these perturbations. serving diverse needs recognition emotions music genres, information, local and global features. To implement our principle, propose a self-supervised method: Bootstrap Your Own Latent (BYOL) Audio (BYOL-A, pronounced “viola”). BYOL-A pre-trains sound invariant data augmentations, which makes learned sounds. Whereas encoder combines calculates their statistics make representation multi-aspect information. As result, information serve tasks. We evaluated task performance compared previous state-of-the-art methods, demonstrated generalizability with best average result 72.4% VoxCeleb1 57.6%. Extensive ablation experiments revealed architecture contributes most performance, final critical portion resorts BYOL framework augmentations. Our code is available online future studies.
منابع مشابه
Tahakum: A Multi-Purpose Audio Control Framework
We present “Tahakum”, an open source, extensible collection of software tools designed to enhance workflow on multichannel audio systems within complex multi-functional research and development environments. Tahakum aims to provide critical functionality required across a broad spectrum of audio systems usage scenarios, while at the same time remaining sufficiently open as to easily support mod...
متن کاملFeature Representations for Neuromorphic Audio Spike Streams
Event-driven neuromorphic spiking sensors such as the silicon retina and the silicon cochlea encode the external sensory stimuli as asynchronous streams of spikes across different channels or pixels. Combining state-of-art deep neural networks with the asynchronous outputs of these sensors has produced encouraging results on some datasets but remains challenging. While the lack of effective spi...
متن کاملWatermarking parametric representations for synthetic audio
This paper proposes to watermark parametric representations for synthetic audio. Our watermark system combines quantization index modulation at the encoder and maximum likelihood parameter estimation at the decoder. To guarantee error-free data hiding under expected types of attacks, knowledge of Fisher information and Cramér-Rao bounds is applied to the system design. Experiments show that, me...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing
سال: 2023
ISSN: ['2329-9304', '2329-9290']
DOI: https://doi.org/10.1109/taslp.2022.3221007